For example,Бобцов

Optimizing knowledge distillation models for language models

Annotation

The problem of optimizing large neural networks is discussed using the example of language models. The size of large language models is an obstacle to their practical application in conditions of limited amounts of computing resources and memory. One of the areas of compression of large neural network models being developed is knowledge distillation, the transfer of knowledge from a large teacher model to a smaller student model without significant loss of result accuracy. Currently known methods of distilling knowledge have certain disadvantages: inaccurate knowledge transfer, long learning process, accumulation of errors in long sequences. The methods that contribute to improving the quality of knowledge distillation in relation to language models are proposed: selective teacher intervention in the student’s learning process and low-level adaptation. The first approach is based on the transfer of teacher tokens when teaching a student to neural network layers, for which an exponentially decreasing threshold of measuring the discrepancy between the probability distributions of the teacher and the student is reached. The second approach suggests reducing the number of parameters in a neural network by replacing fully connected layers with low-rank ones, which reduces the risk of overfitting and speeds up the learning process. The limitations of each method when working with long sequences are shown. It is proposed to combine methods to obtain an improved model of classical distillation of knowledge for long sequences. The use of a combined approach to distilling knowledge on long sequences made it possible to significantly compress the resulting model with a slight loss of quality as well as significantly reduce GPU memory consumption and response output time. Complementary approaches to optimizing the knowledge transfer process and model compression showed better results than selective teacher intervention in the student learning process and low- rank adaptation separately. Thus, the quality of answers of the improved classical knowledge distillation model on long sequences showed 97 % of the quality of full fine-tuning and 98 % of the quality of the low-rank adaptation method in terms of ROGUE-L and Perplexity, given that the number of trainable parameters is reduced by 99 % compared to full fine-tuning and by 49 % compared to low-rank adaptation. In addition, GPU memory usage is reduced by 75 % and 30 %, respectively, and inference time by 30 %. The proposed combination of knowledge distillation methods can find application in problems with limited computational resources.

Keywords

Articles in current issue